Name: Ravisankar Chengannagari
Name: Ashish Bhandari

Analyzing Air Quality Trends: A Comparative Study of New York Neighborhoods and Global Standards

Abstract

Air pollution significantly impacts global health, ecosystems, and economies, necessitating vigilant monitoring and predictive analysis. This study addresses this concern by analyzing air quality trends in New York City and globally, with a focus on Nitrogen dioxide (NO2) levels in New York and PM2.5 concentrations worldwide. The research aims to identify historical trends, understand regional disparities, and examine contributing factors to air pollution, thereby informing policy decisions.

Data is sourced from authoritative platforms, including the NYC Open Data platform and the World Health Organization, utilizing both API access and direct downloads. The methodology encompasses data collection, thorough cleaning and preprocessing, exploratory data analysis, and advanced statistical techniques such as correlation and regression analysis. Geospatial visualization tools highlight pollution hotspots, facilitating easy comparison across regions.

The study reveals key insights into air quality trends, showcasing regional disparities and identifying urban areas with critical air quality issues. The comparative analysis sheds light on PM2.5 concentrations in New York City relative to other major urban centers, uncovering significant findings relevant for urban planning and environmental policies. This project offers a data-driven foundation for understanding air quality trends and contributes to informed decision-making in environmental management. The outcomes highlight the importance of effective air quality management policies and provide a framework for future research in this domain.

Introduction

Air pollution is a significant global issue with serious consequences for public health, ecosystems, and economies. As environmental degradation continues to be a concern, the need for effective monitoring and analysis of air quality trends becomes increasingly crucial. This project aims to study Nitrogen dioxide (NO2) levels in New York City and PM2.5 concentrations worldwide, utilizing advanced statistical and machine learning methods to identify trends, understand regional disparities, and evaluate the extent of air pollution. The goal is to provide a solid foundation for data-driven policy-making and to contribute to improving environmental conditions.

The motivation behind this research is driven by a multifaceted concern for the well-being of both people and the planet. Air pollution poses serious threats to public health, leading to respiratory and cardiovascular diseases that strain healthcare systems and diminish quality of life. Understanding and addressing pollutants such as Nitrogen dioxide (NO2) and PM2.5 is critical because of their detrimental health effects.

Moreover, New York City, as a major economic hub, exemplifies how air pollution can impact economic vitality. The city's industrial activity, vehicular emissions, and urbanization contribute to poor air quality, which, in turn, affects workforce productivity and healthcare costs. By examining trends and disparities in air quality, this research aims to inform policies that improve environmental conditions and enhance economic resilience.

From a global perspective, air pollution not only impacts diverse regions but also highlights the need for effective interventions and sustainable urban planning. The project seeks to provide insights that address critical scientific and business concerns, supporting targeted interventions for healthier environments.

In addition to public health and economic concerns, this research is motivated by the desire to enhance environmental management and support businesses in mitigating pollution-related risks. By contributing to corporate social responsibility initiatives and fostering the development of environmental technologies, this study aligns with broader efforts to create sustainable urban environments that benefit both scientific understanding and economic interests.

The research questions formulated for this project aim to provide a comprehensive understanding of air quality trends and factors affecting air pollution, both locally in New York City and globally. Here’s a detailed explanation of each question:

Research Question 1: Trend Analysis

What are the historical trends in air quality within New York and on a global scale, and how do these trends compare over time?

Objective: To ascertain the historical trends in air quality within New York City and globally, and to compare these trends over time to identify patterns of improvement or deterioration.

Rationale: This analysis provides an in-depth look at how air quality has evolved over the years, both locally and internationally. By examining data on NO2 and PM2.5 levels over extended periods, we can pinpoint periods of significant change which may correlate with specific regulatory actions, industrial growth, or urban development phases. This detailed temporal mapping helps identify when and where air quality initiatives have been successful or where they have failed, offering a historical perspective that enhances future air quality forecasting and management strategies.

Research Question 2: Extremes in Air Quality

Among global regions, which exhibit the highest and lowest levels of air pollution, and what insights can we gain about the factors contributing to these extremes?

Objective: To identify which global regions exhibit the highest and lowest levels of air pollution and to explore the factors contributing to these extremes.

Rationale: Understanding the extremes in air quality across different regions illuminates the most and least polluted areas, prompting a deeper investigation into the environmental, socioeconomic, and policy conditions that lead to such disparities. By studying areas with extreme pollution, we can explore specific local factors like heavy industrialization, low regulations, or geographical and meteorological conditions that contribute to poor air quality. Conversely, regions with exceptionally clean air provide insights into successful environmental practices and regulations, which can serve as models for other areas.

Research Question 3: Local Analysis

Within New York, which neighborhood stands out for having the worst air quality, and what are the potential local contributors to this status?

Objective: To determine which neighborhoods in New York City experience the worst air quality and to investigate potential local contributors to this status.

Rationale: This question delves into the granular impacts of air pollution at the neighborhood level within a major metropolitan area. By identifying the most polluted neighborhoods, the analysis can focus on localized sources of pollution such as traffic congestion, specific industrial activities, or lack of green spaces. Understanding these factors allows for the development of localized interventions that can directly address the sources of pollution in a targeted manner, potentially leading to more effective solutions.

Research Question 4: Comparative Analysis of PM2.5 Concentrations

How do PM2.5 concentrations in New York City compare to other major urban areas around the world?

Objective: To compare PM2.5 concentrations in New York City with those in other major urban areas around the world.

Rationale: This comparison sheds light on how New York City's air quality management strategies measure up against those implemented in other global cities facing similar challenges. By examining PM2.5 levels across different cities, we can assess the relative success of air quality controls and urban planning strategies. This analysis not only highlights areas where New York City may need to bolster its efforts but also provides an opportunity to learn from the successes of other cities. This comparative perspective is essential for adopting best practices and innovative solutions in air quality management.

In this project, we derive data from two critical sources that provide comprehensive and authoritative datasets on air quality, enabling a detailed analysis of both local and global pollution levels. The specific characteristics of each source are elaborated below to emphasize their relevance and reliability in supporting our air quality research.

Research Approach

In our final project, we have adopted a systematic research approach to thoroughly analyze air quality trends in New York City and globally. This approach encompasses several distinct yet interconnected phases: data acquisition, data management, data preparation, exploratory data analysis (EDA), and investigative analysis. Each phase is designed to ensure that our findings are both robust and insightful, supporting effective policy recommendations.

Data Acquisition

Our research project's success critically hinges on the systematic acquisition of high-quality, authoritative data concerning air quality. This section details the meticulous process involved in identifying, selecting, and acquiring the necessary datasets from two primary sources: NYC Open Data and the World Health Organization (WHO). Each step in this process is crafted to ensure that the data not only meets our specific research needs but also adheres to the highest standards of data integrity and reliability.

Initial Data Search:
The search for appropriate data sources began with a comprehensive review of available environmental data repositories that provide open access to air quality measurements. Our criteria for selection included data comprehensiveness, update frequency, geographical specificity, and the reliability of the source. After evaluating several potential sources, NYC Open Data and the WHO were identified as the most suitable for providing the detailed and reliable datasets required for our analysis.

Evaluation of Data Quality and Relevance:
Before finalizing our choice of data sources, we conducted a preliminary assessment of the data quality. This involved reviewing the data collection methodologies used by each source, the frequency of data updates, and the historical depth of the datasets. Both chosen platforms demonstrated robust data collection protocols and provided extensive documentation on their methodologies, ensuring the datasets relevance and reliability for our study.

Methods of Data Acquisition

NYC Open Data:

  • Platform Use: Utilizing the NYC Open Data platform involved accessing their comprehensive API, which provides real-time data feeds and historical data access. This API facilitates the integration of live data streams into our analysis tools, enabling up-to-date and longitudinal studies of NO2 levels across New York City.

  • API Integration: The integration process included setting up API calls tailored to retrieve air quality data specific to our geographic and pollutant criteria. The data is then automatically pulled into our PostgreSQL database, configured to handle large volumes of time-series data efficiently.

World Health Organization (WHO):

  • Direct Data Download: Accessing global air quality data from the WHO involved navigating their Global Health Observatory data repository. We utilized direct download capabilities to obtain structured datasets containing historical and current global PM2.5 levels.
  • Data Structuring and Storage: Post-download, the data was systematically structured into a format compatible with our analytical tools. This process ensured that the global dataset could be seamlessly compared and analyzed alongside the NO2 data from NYC.

Data Compliance and Ethical Considerations

In conducting this research, special attention was given to ensuring compliance with legal standards and ethical guidelines related to data usage. The datasets utilized from NYC Open Data and the World Health Organization (WHO) are publicly available and explicitly provided for research and analysis purposes, thus ensuring legal compliance.

Legal and Open Access Compliance:

NYC Open Data:

  • Public Availability: The data provided by NYC Open Data is intended for public use. The platform encourages the utilization of its data to promote transparency, innovation, and community engagement.
  • Usage Policy: NYC Open Data’s terms of use do not restrict academic and non-commercial use of the data. This allows researchers free access to utilize the data within the bounds of legal and ethical academic research standards.

World Health Organization (WHO):

  • Public Availability: WHO offers global health-related data under open-access terms to support health research worldwide. The data is designed to be freely available to promote international health development and monitoring.
  • Usage Policy: The WHO's data usage policy permits the use of its data for educational and research purposes, aligning perfectly with our project's objectives. The organization promotes the dissemination of information to improve global health outcomes, making the data ideal for our analytical needs.

Ethical Considerations:

  • Data Integrity: We maintain high standards of data integrity, ensuring that the data is represented accurately in our analyses and reports. This involves careful handling of data during transfer, storage, and processing stages to prevent data corruption or loss.
  • Accurate Representation: We commit to accurately representing our data findings without manipulation or bias. The data's integrity is paramount in drawing conclusions and making recommendations based on our analysis.

The data acquisition strategy outlined herein reflects a thorough and deliberate approach to sourcing, integrating, and managing critical environmental data. This foundational work is vital for empowering our subsequent exploratory and investigative analyses, ultimately enabling a nuanced understanding of air quality trends and informing effective policy interventions.

Data Management and Storage
In our research project examining air quality trends in New York City and globally, effective data management and storage are crucial to ensure data integrity, facilitate efficient analysis, and support the scalability of the project. This section elaborates on our sophisticated approach to managing and securely storing large volumes of environmental data, utilizing advanced database technologies and adhering to best practices.

Data Storage Infrastructure

  • Database System: PostgreSQL Implementation Requirement: As required for this project, PostgreSQL is used due to its capabilities in handling complex datasets and its support for extensive data operations, which are essential for environmental data analysis.
  • Database Architecture: Our PostgreSQL database is structured specifically to optimize the storage and retrieval of air quality data. It features separate schemas for local NO2 data from NYC Open Data and global PM2.5 data from WHO, which helps in keeping the data organized and manageable.

Data Management Processes

Data is automatically ingested from the NYC Open Data API and WHO datasets via scripts that run at predetermined intervals. This ensures that our database is consistently updated with the most recent data without manual intervention.

Use of PostgreSQL for data management in this project provides a strong foundation for handling the complex and sensitive data involved in air quality research. The structured approach to data management and storage ensures that our analysis is supported by data that is not only secure and well-managed but also consistently reliable and accessible. This infrastructure is crucial for delivering accurate, actionable insights into air quality trends and for supporting the broader objectives of environmental health research.

Data Preparation

The data preparation phase of our research project is meticulously designed to ensure the integrity and quality of the data used for our analysis of air quality trends in New York City and globally. This crucial stage sets the groundwork for accurate and reliable results, adhering to high standards. The following details the comprehensive steps undertaken to clean, preprocess, and prepare the datasets obtained from NYC Open Data and the World Health Organization (WHO).

Data Cleaning:

  • Identifying and Addressing Missing Values: Initial steps involve a thorough examination of the datasets to identify any missing entries. Appropriate strategies are employed to handle these missing values.
  • Removing Duplicates: We ensure the uniqueness of each data entry by identifying and removing duplicate records. This step is vital for preventing any skewness or bias in the analysis that could arise from redundant data.
  • Error Correction: The datasets are scrutinized for any anomalies or erroneous entries. Corrections are made to rectify any identified errors, and outliers are evaluated to determine whether they represent true extremes in the data or mistakes that require adjustment.

The focused and rigorous data cleaning process employed in this project forms the foundation for all subsequent analyses. By ensuring that our datasets are free from inaccuracies and inconsistencies, we establish a robust base for exploring air quality trends. This meticulous attention to data integrity not only enhances the credibility of our research but also ensures that the insights derived are based on the most reliable data available. As such, our data cleaning efforts are crucial in enabling informed, data-driven conclusions essential for understanding and mitigating air pollution effectively.

Exploratory Data Analysis (EDA)

Strategy Exploratory Data Analysis (EDA) is a fundamental component of our research project, providing the initial deep dive into the air quality data collected from New York City and globally. EDA helps uncover underlying patterns, identify anomalies, and gain a thorough understanding of the dataset's characteristics, which are essential for guiding further statistical analysis and predictive modeling.

Objectives of EDA

The primary objectives of our EDA are to:

  1. Understand Data Characteristics: The fundamental goal here is to grasp the basic properties of the data, which includes understanding the distribution, central tendencies (mean, median), variability (standard deviation, range), and the presence of any anomalies or outliers in the key variables such as NO2 and PM2.5 concentrations.
  2. Identify Patterns and Anomalies: This objective focuses on detecting any trends, cycles, or patterns in the data that could indicate underlying behaviors or seasonal effects. Additionally, identifying anomalies or outliers is crucial for recognizing data errors or extraordinary events that could significantly influence the analysis.
  3. Prepare for Further Analysis: EDA serves to lay the groundwork for more detailed statistical analysis and predictive modeling. By exploring the data initially, researchers can form hypotheses about relationships within the data, decide on appropriate modeling techniques, and identify areas that require more focused analysis.

EDA Techniques Employed

  1. Data Visualization: We employ a range of visualization tools to inspect the data visually. Histograms, box plots, and time series plots allow us to examine the distribution and temporal changes in pollutant levels. Scatter plots help in exploring potential relationships between various environmental factors. Geographic visualization techniques, including heat maps, are utilized to assess the spatial distribution of pollutants across different regions. This helps in identifying areas with higher pollution levels and examining spatial trends.
  2. Statistical Summaries: Comprehensive descriptive statistics are calculated for each critical variable to capture central tendencies, dispersion, and other statistical properties. This includes measures like mean, median, standard deviation, and interquartile ranges. Correlation analyses are performed to determine the relationships between different pollutants and between pollutants and potential influencing factors such as meteorological conditions or traffic data.

Tools and Technologies Used

  • Python: We utilize Python for its robust data handling and analysis capabilities. Libraries such as Pandas facilitate data manipulation, while Matplotlib and Seaborn support the creation of visualizations.

Quality Assurance in EDA

  • We consistently verify the accuracy and completeness of the data throughout the EDA process. This ensures that any insights or patterns identified are based on reliable data.
  • EDA is approached as an iterative process, where initial findings lead to further questions and deeper investigations. Adjustments are made continuously to the analysis based on emerging insights and feedback.

This groundwork supports more sophisticated analyses and helps us draw meaningful, data-driven conclusions that can inform policy decisions and contribute to the broader discourse on environmental health and air quality management.

In conclusion, our comprehensive research approach, meticulously designed across multiple phases—Data Acquisition, Data Preparation, Exploratory Data Analysis, and Investigative Analysis—ensures the integrity and depth of our analysis of air quality trends in New York City and globally. This methodological rigor facilitates a seamless transition from data collection to deep analytical insights, enabling us to address complex environmental questions with precision. Through this structured progression, we uphold stringent academic and ethical standards, ensuring that our findings are not only scientifically robust but also of practical relevance to policy making and public health. Ultimately, our research approach is aimed at providing actionable insights that can significantly impact environmental management and urban planning in the context of air quality improvement.

In [1]:
#imported the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import psycopg2
import http.client
import folium
import plotly.graph_objs as go
import plotly.offline as pyo
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
import warnings
warnings.filterwarnings('ignore')
In [2]:
#read csv file which is stored in github repository
data = pd.read_csv('https://raw.githubusercontent.com/ravi2248/AIM-5001/main/Air_Quality.csv')
In [3]:
#we can see the top 5 rows
data.head()
Out[3]:
Unique ID Indicator ID Name Measure Measure Info Geo Type Name Geo Join ID Geo Place Name Time Period Start_Date Data Value Message
0 179772 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 409.0 Southeast Queens 2015 01/01/2015 0.3 NaN
1 179785 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 209.0 Bensonhurst - Bay Ridge 2015 01/01/2015 1.2 NaN
2 178540 365 Fine particles (PM 2.5) Mean mcg/m3 UHF42 209.0 Bensonhurst - Bay Ridge Annual Average 2012 12/01/2011 8.6 NaN
3 178561 365 Fine particles (PM 2.5) Mean mcg/m3 UHF42 409.0 Southeast Queens Annual Average 2012 12/01/2011 8.0 NaN
4 823217 365 Fine particles (PM 2.5) Mean mcg/m3 UHF42 409.0 Southeast Queens Summer 2022 06/01/2022 6.1 NaN
In [4]:
#we can see the bottom 5 rows
data.tail()
Out[4]:
Unique ID Indicator ID Name Measure Measure Info Geo Type Name Geo Join ID Geo Place Name Time Period Start_Date Data Value Message
18020 816914 643 Annual vehicle miles traveled Million miles per square mile CD 503.0 Tottenville and Great Kills (CD3) 2019 01/01/2019 12.9 NaN
18021 816913 643 Annual vehicle miles traveled Million miles per square mile CD 503.0 Tottenville and Great Kills (CD3) 2010 01/01/2010 14.7 NaN
18022 816872 643 Annual vehicle miles traveled Million miles per square mile UHF42 208.0 Canarsie - Flatlands 2010 01/01/2010 43.4 NaN
18023 816832 643 Annual vehicle miles traveled Million miles per square mile UHF42 407.0 Southwest Queens 2010 01/01/2010 65.8 NaN
18024 151658 643 Annual vehicle miles traveled Million miles per square mile UHF42 408.0 Jamaica 2005 01/01/2005 41.0 NaN
In [5]:
#it will show the number of columns and rows in the dataset
data.shape
Out[5]:
(18025, 12)
In [6]:
#we can see the columns details
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18025 entries, 0 to 18024
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       18025 non-null  int64  
 1   Indicator ID    18025 non-null  int64  
 2   Name            18025 non-null  object 
 3   Measure         18025 non-null  object 
 4   Measure Info    18025 non-null  object 
 5   Geo Type Name   18025 non-null  object 
 6   Geo Join ID     18016 non-null  float64
 7   Geo Place Name  18016 non-null  object 
 8   Time Period     18025 non-null  object 
 9   Start_Date      18025 non-null  object 
 10  Data Value      18025 non-null  float64
 11  Message         0 non-null      float64
dtypes: float64(3), int64(2), object(7)
memory usage: 1.7+ MB
In [7]:
#here we are connecting the postgresql
#to read the data present in the database
con = psycopg2.connect(
    dbname = 'airquality',
    user = 'postgres',
    password = 'Ravi1234@',
    host = 'localhost',
    port = '5432'
)
In [8]:
#this query will return all the data stored in postgresql
query = "SELECT * FROM air_quality"
In [9]:
#we need to use read_sql func from pandas to read the data and stored it
global_data = pd.read_sql(query, con)
In [10]:
#we closed the connection after stored it
con.close()
In [11]:
#we can see the top 5 rows
global_data.head()
Out[11]:
indicatorcode indicator parentlocationcode parentlocation locationtype spatialdimvaluecode location periodtype period islatestyear dim1type dim1 dim1valuecode factvaluenumeric factvaluenumericlow factvaluenumerichigh value
0 SDGPM25 Concentrations of fine particulate matter (PM2.5) AFR Africa Country KEN Kenya Year 2019 True Residence Area Type Cities RESIDENCEAREATYPE_CITY 10.01 6.29 13.74 10.01 [6.29-13.74]
1 SDGPM25 Concentrations of fine particulate matter (PM2.5) AMR Americas Country TTO Trinidad and Tobago Year 2019 True Residence Area Type Rural RESIDENCEAREATYPE_RUR 10.02 7.44 12.55 10.02 [7.44-12.55]
2 SDGPM25 Concentrations of fine particulate matter (PM2.5) EUR Europe Country GBR United Kingdom of Great Britain and Northern I... Year 2019 True Residence Area Type Cities RESIDENCEAREATYPE_CITY 10.06 9.73 10.39 10.06 [9.73-10.39]
3 SDGPM25 Concentrations of fine particulate matter (PM2.5) AMR Americas Country GRD Grenada Year 2019 True Residence Area Type Total RESIDENCEAREATYPE_TOTL 10.08 7.07 13.20 10.08 [7.07-13.20]
4 SDGPM25 Concentrations of fine particulate matter (PM2.5) AMR Americas Country BRA Brazil Year 2019 True Residence Area Type Towns RESIDENCEAREATYPE_TOWN 10.09 8.23 12.46 10.09 [8.23-12.46]
In [12]:
#we can see the bottom 5 rows
global_data.tail()
Out[12]:
indicatorcode indicator parentlocationcode parentlocation locationtype spatialdimvaluecode location periodtype period islatestyear dim1type dim1 dim1valuecode factvaluenumeric factvaluenumericlow factvaluenumerichigh value
9445 SDGPM25 Concentrations of fine particulate matter (PM2.5) AMR Americas Country BLZ Belize Year 2010 False Residence Area Type Cities RESIDENCEAREATYPE_CITY 9.92 3.91 20.28 9.92 [3.91-20.28]
9446 SDGPM25 Concentrations of fine particulate matter (PM2.5) AMR Americas Country TTO Trinidad and Tobago Year 2010 False Residence Area Type Cities RESIDENCEAREATYPE_CITY 9.92 7.80 12.89 9.92 [7.80-12.89]
9447 SDGPM25 Concentrations of fine particulate matter (PM2.5) AFR Africa Country KEN Kenya Year 2010 False Residence Area Type Cities RESIDENCEAREATYPE_CITY 9.94 6.30 13.57 9.94 [6.30-13.57]
9448 SDGPM25 Concentrations of fine particulate matter (PM2.5) AMR Americas Country USA United States of America Year 2010 False Residence Area Type Cities RESIDENCEAREATYPE_CITY 9.95 9.78 10.11 9.95 [9.78-10.11]
9449 SDGPM25 Concentrations of fine particulate matter (PM2.5) EMR Eastern Mediterranean Country AFG Afghanistan Year 2010 False Residence Area Type Cities RESIDENCEAREATYPE_CITY 92.79 66.17 128.40 92.79 [66.17-128.44]
In [13]:
#it will show the number of columns and rows in the dataset
global_data.shape
Out[13]:
(9450, 17)
In [14]:
#we can see the columns details
global_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9450 entries, 0 to 9449
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   indicatorcode         9450 non-null   object 
 1   indicator             9450 non-null   object 
 2   parentlocationcode    9450 non-null   object 
 3   parentlocation        9450 non-null   object 
 4   locationtype          9450 non-null   object 
 5   spatialdimvaluecode   9450 non-null   object 
 6   location              9450 non-null   object 
 7   periodtype            9450 non-null   object 
 8   period                9450 non-null   object 
 9   islatestyear          9450 non-null   bool   
 10  dim1type              9450 non-null   object 
 11  dim1                  9450 non-null   object 
 12  dim1valuecode         9450 non-null   object 
 13  factvaluenumeric      9450 non-null   float64
 14  factvaluenumericlow   9450 non-null   float64
 15  factvaluenumerichigh  9450 non-null   float64
 16  value                 9450 non-null   object 
dtypes: bool(1), float64(3), object(13)
memory usage: 1.2+ MB


Exploratory Data Analysis

For exploring the data, we have made so many plots like bar plots, line plots, etc... and used folium and plotly libraries for maps to analyse the trends in the data. We can see the below plots and their results.

In [15]:
#Here we can see the unique names of indicators and there frequency in the dataset
data['Name'].value_counts()
Out[15]:
Name
Nitrogen dioxide (NO2)                                    5922
Fine particles (PM 2.5)                                   5922
Ozone (O3)                                                2115
Asthma emergency departments visits due to Ozone           485
Asthma hospitalizations due to Ozone                       484
Asthma emergency department visits due to PM2.5            480
Annual vehicle miles traveled (cars)                       321
Annual vehicle miles traveled                              321
Annual vehicle miles traveled (trucks)                     321
Respiratory hospitalizations due to PM2.5 (age 20+)        240
Cardiovascular hospitalizations due to PM2.5 (age 40+)     240
Cardiac and respiratory deaths due to Ozone                240
Deaths due to PM2.5                                        240
Outdoor Air Toxics - Benzene                               203
Outdoor Air Toxics - Formaldehyde                          203
Boiler Emissions- Total SO2 Emissions                       96
Boiler Emissions- Total NOx Emissions                       96
Boiler Emissions- Total PM2.5 Emissions                     96
Name: count, dtype: int64
In [16]:
#It will plot a bar graph
#we can see the names of indicators and their frequencies
plt.figure(figsize = (10, 8))
data['Name'].value_counts().plot(kind = 'bar', color = 'lightgreen')
plt.xlabel('Name')
plt.ylabel('Frequency')
plt.xticks(rotation = 45, ha = 'right')
plt.tight_layout()
plt.show()

Discussion of result:
From the above bar graph, we can see the indicators and their frequencies. We can observe that Nitrogen dioxide (N02) (5922), Fine particles (PM 2.5) (5922), and Ozone (03) (2115) have high frequncies in the dataset. And Boiler Emissions - Total SO2 Emissions (96), Total NOx Emissions (96), and Total PM2.5 Emissions (96) have less frequencies in the dataset.

In [17]:
#Here we can see the frequencies of type of geographic data
data['Geo Type Name'].value_counts()
Out[17]:
Geo Type Name
UHF42       7140
CD          6490
UHF34       3366
Borough      859
Citywide     170
Name: count, dtype: int64
In [18]:
#It will plot a bar graph
#we can see the types of geographic areas and their frequencies
plt.figure(figsize = (10, 6))
data['Geo Type Name'].value_counts().plot(kind = 'bar', color = 'skyblue')
plt.xlabel('Geo Type Name')
plt.ylabel('Frequency')
plt.xticks(rotation = 45, ha = 'right')
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see that types of geographic areas and their frequencies. We can observe that UHF42 (7140), and CD (6490) have high frequencies and Citywide (170) has less frequency as compared to others in the dataset.

In [19]:
#Here we are getting the data of the indicator named Fine particles (PM 2.5)
pm2_5_data = data[data['Name'] == 'Fine particles (PM 2.5)']
In [20]:
#Here we are changing the data type to datetime
pm2_5_data['Start_Date'] = pd.to_datetime(pm2_5_data['Start_Date'])
In [21]:
#It will plot PM 2.5 concentration over time
#on x-axis we can see time and on y-axis we can see PM 2.5 concentration
plt.figure(figsize = (10, 6))
plt.plot(pm2_5_data['Start_Date'], pm2_5_data['Data Value'], marker = 'o', linestyle = '', color = 'orange')
plt.title('Trend of PM2.5 Concentration Over Time')
plt.xlabel('Time')
plt.ylabel('PM2.5 Concentration (mcg/m3)')
plt.xticks(rotation = 45)
plt.grid(True)
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see the trend of PM 2.5 concentrations over time. Here, we can observe some dates have high PM 2.5 concentrations and some other dates have less PM 2.5 concentrations. It is look like up and downs in the PM 2.5 concentrations over time.

In [22]:
#it shows a line plot
#we can see new york city air quality trends over time
plt.figure(figsize = (12, 6))
sns.lineplot(data = data, x = "Start_Date", y = "Data Value", hue = "Name", style = "Name", markers = True, dashes = False)
plt.title("New York City Air Quality Trends Over Time")
plt.xlabel("Date")
plt.ylabel("Data Value")
plt.xticks(rotation = 90)
plt.legend(title = "Indicator", bbox_to_anchor = (1, 1), loc = 'upper left')
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see the New York City Air Quality trends over time. Here, we can observe the different pollutants. We can see the PM 2.5 concentrations which is in orange color, the graph starts in 2008 and ends at 2022, we can observe some minor ups and downs in the PM 2.5 concentration value.

In [23]:
#it will plot a boxplot graph
#it shows distribution of air quality across countries
plt.figure(figsize = (12, 6))
sns.boxplot(data = global_data, x = "parentlocation", y = "factvaluenumeric")
plt.title("Distribution of Air Quality Indicator Across Countries")
plt.xlabel("Country")
plt.ylabel("Data Value")
plt.xticks(rotation = 45)
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see the distribution of air quality across parent location of countries. Here, we can observe the distribution of air quality across different parent locations of countries like Africa, Americas, Europe, Western Pacific, South-East Asia, and Eastern Mediterranean. It looks like Eastern Mediterranean countries has higher value.

In [24]:
#it plots a barplot
#it shows comparision of PM 2.5 Concentrations across major urban areas
plt.figure(figsize = (14, 8))
sns.barplot(data = data[data["Name"] == "Fine particles (PM 2.5)"], x = "Geo Place Name", y = "Data Value", ci = None)
plt.title("Comparison of PM 2.5 concentrations across major Urban Areas")
plt.xlabel("Urban Area")
plt.ylabel("Mean PM2.5 Concentration")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see the comparision of PM 2.5 concentrations across major urban areas. Here, we can observe the comparision of PM 2.5 of urban areas, there are so many ups and downs in the PM 2.5 concentrations means the neighbourhoods have different PM 2.5 concentrations.

In [25]:
#it shows a line plot
#we can see global air quality trends over time
plt.figure(figsize = (12, 6))
sns.lineplot(data = global_data, x = "period", y = "factvaluenumeric", hue = "parentlocation", style = "parentlocation", markers = True, dashes = False)
plt.title("Global Air Quality Trends over time")
plt.xlabel("Year")
plt.ylabel("Data Value")
plt.xticks(rotation = 45)
plt.legend(title = "Parent Location", bbox_to_anchor = (1, 1), loc = 'upper left')
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see the global air quality trends over time. We can observe the data is from 2010 to 2019. And Eastern Mediterranean has higher PM 2.5 concentration value and Americas has lower PM 2.5 concentration value. If we see the graph properly, we can observe that there is not much change in the graph of Americas.

In [26]:
#it shows a bar plot
#it shows maximum air quality indicators across global regions
plt.figure(figsize = (10, 6))
max_values = global_data.groupby("parentlocation")["factvaluenumeric"].max().sort_values(ascending=False)
sns.barplot(x = max_values.values, y = max_values.index, palette = "viridis")
plt.title("Maximum Air Quality Indicators Across Global Regions")
plt.xlabel("Maximum Data Value")
plt.ylabel("Region")
plt.tight_layout()
plt.show()

Discussion of Result:
From the above graph, we can see clearly the maximum air quality across parent location of countries. Here, we can observe the distribution of air quality across different parent locations of countries like Africa, Americas, Europe, Western Pacific, South-East Asia, and Eastern Mediterranean. Eastern Mediterranean countries has higher value.

In [27]:
#here we fetch the values of countries and their PM2.5 concentration values
PM25_map_data = {
    'Country': global_data['location'],
    'Value': global_data['factvaluenumeric']
}
In [28]:
#we converted the data into dataframe
PM25_map_data = pd.DataFrame(PM25_map_data)
In [29]:
# Create choropleth map
map_data = dict(
        type = 'choropleth',
        locations = PM25_map_data['Country'],
        locationmode = 'country names',
        z = PM25_map_data['Value'],
        text = PM25_map_data['Country'],
        colorscale = 'Blues',
        colorbar = {'title' : 'Value'}
      )
In [30]:
layout = dict(title = 'PM2.5 Concentrations of global data',
              geo = dict(showframe = False, projection = {'type':'mercator'})
             )
In [31]:
choromap = go.Figure(data = [map_data],layout = layout)
In [32]:
choromap

Discussion of Result:
From the above map data, we can see the PM 2.5 concentrations of global data. If we point our cursor on the map, we can see the PM 2.5 concentration values of that area. For example, If I place my cursor on United States of America, it will show us the data value of USA which is 6.42. Similarly, we can observe the data of all countries in the map.

In [33]:
#Here we are trying to connect the api which has present info of NYC
conn = http.client.HTTPSConnection("api.ambeedata.com")

headers = {
    'x-api-key': "933a144f6ff91631a9dea379dc86e5d83e7454f735c4ed1d494f7acb7f07253f",
    'Content-type': "application/json"
    }

conn.request("GET", "/latest/by-lat-lng?lat=40.730610&lng=-73.935242", headers=headers)

res = conn.getresponse()
api_data = res.read()
In [34]:
#here we are decoding the data into utf-8
api_data = api_data.decode('utf-8')
In [35]:
#we can see the data
api_data
Out[35]:
'{"message":"success","stations":[{"CO":0.531,"NO2":27.06,"OZONE":20.102,"PM10":49.87,"PM25":11.196,"SO2":0.564,"city":"New York","countryCode":"US","division":"New York","lat":40.7139,"lng":-74.007,"placeName":"Broadway","postalCode":"10007-0052","state":"New York","updatedAt":"2024-05-07T02:00:00.000Z","AQI":47,"aqiInfo":{"pollutant":"PM2.5","concentration":11.196,"category":"Good"}}]}'
In [36]:
#here we are importing the json library
import json
In [37]:
#we are loading the api string data to convert it into dictionary
api_data = json.loads(api_data)
In [38]:
#here we can see it
api_data
Out[38]:
{'message': 'success',
 'stations': [{'CO': 0.531,
   'NO2': 27.06,
   'OZONE': 20.102,
   'PM10': 49.87,
   'PM25': 11.196,
   'SO2': 0.564,
   'city': 'New York',
   'countryCode': 'US',
   'division': 'New York',
   'lat': 40.7139,
   'lng': -74.007,
   'placeName': 'Broadway',
   'postalCode': '10007-0052',
   'state': 'New York',
   'updatedAt': '2024-05-07T02:00:00.000Z',
   'AQI': 47,
   'aqiInfo': {'pollutant': 'PM2.5',
    'concentration': 11.196,
    'category': 'Good'}}]}
In [39]:
#Here we separated the above api data took values of latitude, longitude, and PM 2.5
ny_present_data = {
    'Latitude': [api_data['stations'][0]['lat']],
    'Longitude': [api_data['stations'][0]['lng']],
    'PM2.5': [api_data['stations'][0]['PM25']]
}
In [40]:
#Here we converted into dataframe
ny_present_data = pd.DataFrame(ny_present_data)
In [41]:
#it will create a map centered around NYC
nyc_map = folium.Map(location = [api_data['stations'][0]['lat'], api_data['stations'][0]['lng']], zoom_start = 10)
In [42]:
#here we are adding markers for the data point
for index, row in ny_present_data.iterrows():
    folium.CircleMarker(
        location = [row['Latitude'], row['Longitude']],
        radius = row['PM2.5'],
        color = 'red',
        fill = True,
        fill_color = 'red',
        fill_opacity = 1,
        popup = f"PM2.5: {row['PM2.5']} µg/m³"
    ).add_to(nyc_map)
In [43]:
#we can see the map
nyc_map
Out[43]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Discussion of Result:
From the above map data, we can see the PM 2.5 concentration value of New York city data. If we click our cursor on the red point which is on the New York City, we can see the PM 2.5 concentration value of New York which is 12.614 currently (it will change based on time if we re-run the code cells and also api calls are limited and also the radius of the circle is based on the PM2.5 data value).


Data Preparation

We have to prepare the data by eliminating and imputing the missing values from the data, extracting features from the data and have to develop a predictive model.
Steps:

  1. First, we need to check the null values in the New York dataset and global datasets using isnull() function.
  2. Next, we need to handle the missing values by eliminating or imputing the datasets. If the missing data is very small as compared to the size of the data, we can eliminate it instead of using imputation methods.
  3. After that we need to extract the features from the dataset to build a predictive model.
  4. Next, we need to split the data into training and testing. Train data is to train the model which we can take 80% of the original data and test data is to test the model which we can take 20% of the original data.
  5. Now, we can build the predictive model like RandomForestRegressor which we are going to use.
  6. We can check the model performance using methods like mean squared error.
  7. At last, we can plot the actual values and predicted values to check the similarities or relation.
In [44]:
#here we can see the total null values of ny data columns
data.isnull().sum()
Out[44]:
Unique ID             0
Indicator ID          0
Name                  0
Measure               0
Measure Info          0
Geo Type Name         0
Geo Join ID           9
Geo Place Name        9
Time Period           0
Start_Date            0
Data Value            0
Message           18025
dtype: int64

Here, we can see there some missing value columns like Geo Join ID (9), Geo Place Name (9), and Message (18025).

In [45]:
#Here we are counting the unique values of Message column
data['Message'].value_counts()
Out[45]:
Series([], Name: count, dtype: int64)

Here, we can observe that there is no value in the Message column. So, it is better to drop this column from the dataset so that the data will gets cleaned.

In [46]:
#we are dropping the column named Message
data = data.drop(columns = ['Message'])
In [47]:
#we can see the top 5 rows after dropping
data.head()
Out[47]:
Unique ID Indicator ID Name Measure Measure Info Geo Type Name Geo Join ID Geo Place Name Time Period Start_Date Data Value
0 179772 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 409.0 Southeast Queens 2015 01/01/2015 0.3
1 179785 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 209.0 Bensonhurst - Bay Ridge 2015 01/01/2015 1.2
2 178540 365 Fine particles (PM 2.5) Mean mcg/m3 UHF42 209.0 Bensonhurst - Bay Ridge Annual Average 2012 12/01/2011 8.6
3 178561 365 Fine particles (PM 2.5) Mean mcg/m3 UHF42 409.0 Southeast Queens Annual Average 2012 12/01/2011 8.0
4 823217 365 Fine particles (PM 2.5) Mean mcg/m3 UHF42 409.0 Southeast Queens Summer 2022 06/01/2022 6.1
In [48]:
#we can see the total null values of ny data columns
data.isnull().sum()
Out[48]:
Unique ID         0
Indicator ID      0
Name              0
Measure           0
Measure Info      0
Geo Type Name     0
Geo Join ID       9
Geo Place Name    9
Time Period       0
Start_Date        0
Data Value        0
dtype: int64

Here, we can see there still some missing values in the Geo Join ID and Geo Place Name columns. Both the columns have less missing values if we compare it with the size of the data. So, we can drop those rows instead of imputing these columns which may create false information.

In [49]:
#Here we are dropping rows which has Geo join ID or Geo Place Name values are null
data.dropna(subset = ['Geo Join ID', 'Geo Place Name'], inplace = True)
In [50]:
#now we can see that there is no missing data in the dataset of ny
data.isnull().sum()
Out[50]:
Unique ID         0
Indicator ID      0
Name              0
Measure           0
Measure Info      0
Geo Type Name     0
Geo Join ID       0
Geo Place Name    0
Time Period       0
Start_Date        0
Data Value        0
dtype: int64
In [51]:
#here we can see the total null values of global data columns
global_data.isnull().sum()
Out[51]:
indicatorcode           0
indicator               0
parentlocationcode      0
parentlocation          0
locationtype            0
spatialdimvaluecode     0
location                0
periodtype              0
period                  0
islatestyear            0
dim1type                0
dim1                    0
dim1valuecode           0
factvaluenumeric        0
factvaluenumericlow     0
factvaluenumerichigh    0
value                   0
dtype: int64

We can see that there is no missing data present in the global data of PM2.5 concentrations

In [52]:
#Here we are extracting features like year from time period data
data['year'] = pd.to_datetime(data['Time Period'], errors = 'coerce').dt.year
In [53]:
#Here we are extracting features like month from time period data
data['month'] = pd.to_datetime(data['Time Period'], errors = 'coerce').dt.month
In [54]:
#we are dropping the null values before building predictive model
data.dropna(inplace = True)
In [55]:
#we are splitting the data into independent and dependent variable
X = data[['Indicator ID', 'Geo Type Name', 'Geo Place Name', 'year', 'month']]
y = data['Data Value']
In [56]:
#Here we are encoding the categorical variables
X = pd.get_dummies(X)
In [57]:
#we are spliting the data into training 80% and testing 20%.  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
In [58]:
#making a model of RandomForest
model = RandomForestRegressor()

Here, we have used a RandomForestRegressor model. It is a powerful and flexible model which is widely used for regression tasks because of its ability to handle complex datasets, robustness and scalability.

In [59]:
#we are training the model
model.fit(X_train, y_train)
Out[59]:
RandomForestRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor()
In [60]:
#Here we are getting predictions
predictions = model.predict(X_test)
In [61]:
#we are calculating the mean squared error which tell us about the model performance
mse = mean_squared_error(y_test, predictions)
In [62]:
#we can see the mean squared error
mse
Out[62]:
74.57973174397588
In [63]:
#it plots a scatter plot 
#we can see the distribution of actual values and predicted values 
plt.figure(figsize = (8, 6))
plt.scatter(y_test, predictions)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()

Discussion of Result:
From the above graph, we can see the distribution of actual values VS predicted values. If we observe it properly, we can identify some similarities and relationship between the actual values and the predicted values.


Prepped Data Analysis

We can see the dataset again after cleaning, and feature engineering steps.

In [64]:
#here we can see the top 5 rows
data.head()
Out[64]:
Unique ID Indicator ID Name Measure Measure Info Geo Type Name Geo Join ID Geo Place Name Time Period Start_Date Data Value year month
0 179772 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 409.0 Southeast Queens 2015 01/01/2015 0.3 2015.0 1.0
1 179785 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 209.0 Bensonhurst - Bay Ridge 2015 01/01/2015 1.2 2015.0 1.0
16 130413 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 210.0 Coney Island - Sheepshead Bay 2013 01/01/2013 0.9 2013.0 1.0
17 130412 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 209.0 Bensonhurst - Bay Ridge 2013 01/01/2013 1.7 2013.0 1.0
18 130434 640 Boiler Emissions- Total SO2 Emissions Number per km2 number UHF42 410.0 Rockaways 2013 01/01/2013 0.0 2013.0 1.0
In [65]:
#we can see the bottom 5 rows
data.tail()
Out[65]:
Unique ID Indicator ID Name Measure Measure Info Geo Type Name Geo Join ID Geo Place Name Time Period Start_Date Data Value year month
18020 816914 643 Annual vehicle miles traveled Million miles per square mile CD 503.0 Tottenville and Great Kills (CD3) 2019 01/01/2019 12.9 2019.0 1.0
18021 816913 643 Annual vehicle miles traveled Million miles per square mile CD 503.0 Tottenville and Great Kills (CD3) 2010 01/01/2010 14.7 2010.0 1.0
18022 816872 643 Annual vehicle miles traveled Million miles per square mile UHF42 208.0 Canarsie - Flatlands 2010 01/01/2010 43.4 2010.0 1.0
18023 816832 643 Annual vehicle miles traveled Million miles per square mile UHF42 407.0 Southwest Queens 2010 01/01/2010 65.8 2010.0 1.0
18024 151658 643 Annual vehicle miles traveled Million miles per square mile UHF42 408.0 Jamaica 2005 01/01/2005 41.0 2005.0 1.0
In [66]:
#here we can see the decription of the columns
data.describe()
Out[66]:
Unique ID Indicator ID Geo Join ID Data Value year month
count 1657.000000 1657.000000 1657.000000 1657.000000 1657.000000 1657.0
mean 461609.864816 644.091129 263.517200 32.433736 2011.541340 1.0
std 319311.000973 1.911578 138.287712 42.480830 4.861327 0.0
min 130397.000000 640.000000 1.000000 0.000000 2005.000000 1.0
25% 154472.000000 643.000000 201.000000 1.900000 2005.000000 1.0
50% 315590.000000 644.000000 302.000000 6.100000 2011.000000 1.0
75% 816928.000000 645.000000 403.000000 56.600000 2015.000000 1.0
max 817342.000000 647.000000 504.000000 284.700000 2019.000000 1.0
In [67]:
#here we can see the info of the data
data.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1657 entries, 0 to 18024
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unique ID       1657 non-null   int64  
 1   Indicator ID    1657 non-null   int64  
 2   Name            1657 non-null   object 
 3   Measure         1657 non-null   object 
 4   Measure Info    1657 non-null   object 
 5   Geo Type Name   1657 non-null   object 
 6   Geo Join ID     1657 non-null   float64
 7   Geo Place Name  1657 non-null   object 
 8   Time Period     1657 non-null   object 
 9   Start_Date      1657 non-null   object 
 10  Data Value      1657 non-null   float64
 11  year            1657 non-null   float64
 12  month           1657 non-null   float64
dtypes: float64(4), int64(2), object(7)
memory usage: 181.2+ KB
In [68]:
#it plots a scatter plot
#we can see the scatter plot of Indicator ID and Data value
plt.figure(figsize = (12, 6))
plt.scatter(data['Indicator ID'], data['Data Value'])
plt.xlabel('Indicator ID')
plt.ylabel('Data Value')
plt.title('Scatter Plot of Indicator ID vs. Data Value')
plt.show()
In [69]:
#it plots a scatter plot
#we can see the data values over changing years
plt.scatter(data['year'], data['Data Value'])
plt.xlabel('Years')
plt.ylabel('Data Value')
plt.title('Scatter Plot of Years vs. Data Value')
plt.show()


Investigative Analysis and Results

We have performed some other predictive model analysis other than RandomForestRegressior. The other models gives higher mean squared errors if we compared it with RandomForestRegressor mean squared error.

In [70]:
#we are splitting the data into independent and dependent variable
X = data[['Indicator ID', 'Geo Type Name', 'Geo Place Name', 'year', 'month']]
y = data['Data Value']
In [71]:
#Here we are encoding the categorical variables
X = pd.get_dummies(X)
In [72]:
#we are spliting the data into training 80% and testing 20%.  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [73]:
#making a model of Linear Regression
model = LinearRegression()
In [74]:
#we are training the model
model.fit(X_train, y_train)
Out[74]:
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
In [75]:
#Here we are getting predictions
predictions = model.predict(X_test)
In [76]:
#we are calculating the mean squared error which tell us about the model performance
mse = mean_squared_error(y_test, predictions)
In [77]:
#we can see the mean squared error
mse
Out[77]:
1337.7076638095348

Here, we can see the mean squared error which we got from Linear regression model which is very higher and the model not performed well with this dataset. The RandomForestRegressor gives better and acceptable result as compared to linear regression.

In [78]:
#it plots a scatter plot 
#we can see the distribution of actual values and predicted values 
plt.figure(figsize = (8, 6))
plt.scatter(y_test, predictions)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()

Discussion of Result:
From the above graph, we can see the distribution of actual values VS predicted values. If we observe it, we can't identify any similarities and relationship between the actual values and the predicted values. So, the Linear Regression model is not performed well with the dataset.

In [79]:
#we are splitting the data into independent and dependent variable
X = data[['Indicator ID', 'Geo Type Name', 'Geo Place Name', 'year', 'month']]
y = data['Data Value']
In [80]:
#Here we are encoding the categorical variables
X = pd.get_dummies(X)
In [81]:
#we are spliting the data into training 80% and testing 20%.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [82]:
#making a model of Support Vector Regression
model = SVR(kernel='rbf')
In [83]:
#we are training the model
model.fit(X_train, y_train)
Out[83]:
SVR()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVR()
In [84]:
#Here we are getting predictions
predictions = model.predict(X_test)
In [85]:
#we are calculating the mean squared error which tell us about the model performance
mse = mean_squared_error(y_test, predictions)
In [86]:
#we can see the mean squared error
mse
Out[86]:
2094.799139328058

Here, we can see the mean squared error which we got from Support Vector Regression model which is very higher and the model not performed well with this dataset. The RandomForestRegressor gives better and acceptable result as compared to support vector regression.

In [87]:
#it plots a scatter plot 
#we can see the distribution of actual values and predicted values 
plt.figure(figsize = (8, 6))
plt.scatter(y_test, predictions)
plt.xlabel('Actual')
plt.ylabel('Predicted')
plt.title('Actual vs. Predicted')
plt.show()

Discussion of Result:
From the above graph, we can see the distribution of actual values VS predicted values. If we observe it, we can't identify any similarities and relationship between the actual values and the predicted values. So, the Support Vector Regression model is not performed well with the dataset.


Conclusion

In conclusion, this project aimed to analyze air quality data to understand temporal and spatial variations in air pollution levels. We started from preprocessing the data which converting data types, handling missing values, and performing feature engineering techniques to develop relevant features for modeling.
After that, we develop predictive models using machine learning algorithms like Random Forest Regressor to predict air pollution levels based on different factors like geographic location, time period, and other indicators. The model performed well with reasonable mean squared error which indicates predictive capability.
In Exploratory Data Analysis, we have given all the details of the New York data, and the global data. From the new york data, we can see so many pollutants. From that we choosed PM 2.5 because it has so many health risks like lung cancer, respiratory issues, etc... We also get to know how the PM 2.5 pollutant values are changing by time period like today it may show some less value and tommorow may higher. And we also seen the PM 2.5 pollutant values across different parent locations of different countries.
We have also plotted maps using folium and plotly libraries to see the new york city's present data which can change according to the time (if we re-run the code cell, we will get an updated value) and another is global data which has parent locations of different countries with their PM 2.5 values (we can see the values when we moved our cursor to the particular location we want to know).
In conclusion, while we were able to address some of the research questions posed in the proposal, there is still room for improvement and further investigation. Future extensions of this work could include incorporating additional data sources, such as meteorological data, traffic patterns, and land use data, to enhance the predictive models' accuracy and robustness. Additionally, exploring more advanced machine learning algorithms and ensemble techniques could lead to better predictions and a deeper understanding of air quality dynamics.

In [ ]: